1 - Introduction

In this project, we are asked to predict hourly solar power production of KIVANC 2 GES between May 25 - June 3, 2022. Every prediction consists of 24 hours of the next day. The available data used in prediction process include production information of previous day, and independent weather variables such as temperature, relative humidity, downward shortwave radiation flux and cloud cover. Weather variables are given for 9 coordinates around solar power plant.

Firstly, it is a good idea to check our data visually before constructing prediction models. Below is the plot of the daily total production for available data to examine the general behavior of the production.

#convert hourly data to daily data
daily_data=production_with_weather[,list(total_p=sum(production),max_p=max(production),
                      mean_p=mean(production)),list(date)]

ggplot(daily_data ,aes(x=date,y=total_p)) +
  geom_line(color = "darkred") +
  labs(title = "Daily Total Production for Available Data",
       x = "Date",
       y= "Production (MWh)" ) +
  scale_x_date(date_breaks = "3 month", date_labels = "%d %b %y") +
  theme_minimal()

After checking the daily total production plot, it can be observed that the daily total productions fluctuate in consecutive days that might be due to changing weather conditions. After the interpretations, it is logical to add independent weather variables into our models.

Next plot is for daily maximum production for available data.

#Daily Maximum Production for Available Data
ggplot(daily_data ,aes(x=date,y=max_p)) + 
  geom_line(color = "darkred") +
  labs(title = "Daily Maximum Production for Available Data",
       x = "Date",
       y= "Production (MWh)" ) +
  scale_x_date(date_breaks = "3 month", date_labels = "%d %b %y") +
  theme_minimal()

In this plot, it can be seen that production capacity changes over time. Especially, after the setup of the power plant there is an increase in the capacity until summer of 2021. Toward the next winter, the capacity decreased to 35 MWh. The interesting part is that in our prediction period, the capacity started to increase again. This capacity will be used in our prediction models. We can also comment that the capacity depends on the weather conditions similar to daily total production. With this result, the production will be normalized according to capacity and added into our model. In our trials, we checked both the models with actual production and normalized production. We concluded to use normalized production in the model since it gave better statistics.

Another plot is the daily total production for a month to see weekly behaviour. The period for this plot is selected from a constant capacity time.

# daily plot
ggplot(data = daily_data[which(date=='2021-07-01'):which(date=='2021-07-28')], aes(x = date, y = total_p)) +
  geom_line(color = "darkred") +
  labs(title = "Daily Total Production Data between 01/07/21 and 28/07/21 ",
       x = "Date",
       y= "Production (MWh)" ) +
  scale_x_date(date_breaks = "7 days", date_labels = "%d %b %y") +
  theme_minimal()

After checking the, we concluded that there is no weekly seasonality. However we need to check hourly production to check whether there is daily seasonality or not, as well.

Next is the hourly production selected from the first 10 days of the available data.

copy <- production_with_weather
copy$hour <- as.numeric(copy$hour)
copy[,datetime:=ymd(date)+dhours(hour)]
copy=copy[order(datetime)]
# hourly plot
ggplot(data = copy[1:240], aes(x = datetime, y = production)) +
  geom_line(color = "darkred") +
  labs(title = "Daily Consumption Data between 01/02/21 and 10/02/21 ",
       x = "Date",
       y= "Consumption (MWh)")  

It is clear that, there is a daily seasonality in the data. This result is aligned with our intuitions coming from the existence of the daylight. With this result, we add hourly seasonality into our models. Since the locations are next to each other, we wanted check each location’s correlation with the production to check if they are similar.

#----------------Model 2-------------------
# check the correlations
vec <- as.matrix(cor(train_data[,c(4:39)], train_data$production))
mat <- matrix(vec,nrow=9)
mat <- as.data.table(mat)
colnames(mat) <- c("Cloud Cover", "Flux", "Humidity", "Temp")
mat
##    Cloud Cover      Flux   Humidity      Temp
## 1:  -0.2021970 0.6835513 -0.4287398 0.4528156
## 2:  -0.1923827 0.6838516 -0.4382425 0.4570686
## 3:  -0.1684673 0.6813497 -0.4215632 0.4493689
## 4:  -0.1916661 0.6883419 -0.4938625 0.4374663
## 5:  -0.1899290 0.6877258 -0.4889962 0.4279378
## 6:  -0.1775050 0.6826705 -0.5104948 0.4492146
## 7:  -0.1821306 0.6875003 -0.4336438 0.4082699
## 8:  -0.1888168 0.6842002 -0.4540360 0.4149490
## 9:  -0.1774078 0.6832661 -0.4282533 0.4092934

The result shows us that, each location has a similar correlation with the production. Therefore, instead of using each location in the model, we decided to use their average values as solely predictors for weather variables. In our model trials, we checked each case separately (average values and separate coordination values), and decided to continue with average values.

3 - Approach

A base model including the averages of weather variables, hourly seasonality and normalized production according to capacity is constructed according to information given above. To improve the model, some adjustments will be made in this section.

Base Model

The base model is as follows:

daily_max_production=production_with_forecast[,list(max_prod=max(production)),by=list(date)]
daily_max_production[,rolling_max:=frollapply(max_prod,30,max,na.rm=T)]

# due to the loss of information, we used the daily max for each day for the first 30 days
daily_max_production$rolling_max[1:29] = frollapply(daily_max_production$max_prod[1:29],5,max,na.rm=T)
daily_max_production$rolling_max[1:4] <- as.double(max(daily_max_production$max_prod[1:4])) 

# merge with regression data
production_with_weather_means_capacity=merge(production_with_weather_means,daily_max_production,by=c('date'))
production_with_weather_means_capacity[,normalized_production:=production/rolling_max]

# test and train phases
production_with_weather_means_capacity[,hour:=as.factor(hour)]
train_data_means_capacity=production_with_weather_means_capacity[!is.na(production)]
test_data_means_capacity=production_with_weather_means_capacity[is.na(production)]

# control phase to check the statistics of the model
control_data_means_capacity=production_with_weather_means_capacity[evaluation_period]
control_data_means_capacity[,datetime:=as.POSIXct(paste(control_data_means_capacity$date, control_data_means_capacity$hour), format="%Y-%m-%d %H")]
control_data_means_capacity=control_data_means_capacity[order(datetime)]

# for a week control phase 
control_data_1week_means_capacity=train_data_means_capacity[(nrow(train_data_means_capacity)-167):nrow(train_data_means_capacity)]
control_data_1week_means_capacity[,datetime:=as.POSIXct(paste(control_data_1week_means_capacity$date, control_data_1week_means_capacity$hour), format="%Y-%m-%d %H")]
control_data_1week_means_capacity=control_data_1week_means_capacity[order(datetime)]


# model 3 with daily max in a moving manner to normalize with means of locations
lm_model3=lm(normalized_production~.,train_data_means_capacity[,-c('date','rolling_max','production','max_prod'),with=F])
summary(lm_model3)
## 
## Call:
## lm(formula = normalized_production ~ ., data = train_data_means_capacity[, 
##     -c("date", "rolling_max", "production", "max_prod"), with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90807 -0.04346 -0.00410  0.07947  0.55774 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.740e-02  7.153e-02   0.802  0.42233    
## hour1          3.854e-04  9.759e-03   0.039  0.96850    
## hour2          8.237e-04  9.760e-03   0.084  0.93274    
## hour3          1.271e-03  9.761e-03   0.130  0.89637    
## hour4          2.231e-03  9.762e-03   0.229  0.81924    
## hour5          3.931e-03  9.764e-03   0.403  0.68725    
## hour6          3.876e-02  9.765e-03   3.969 7.26e-05 ***
## hour7          2.509e-01  9.762e-03  25.700  < 2e-16 ***
## hour8          5.656e-01  9.771e-03  57.880  < 2e-16 ***
## hour9          7.087e-01  9.818e-03  72.186  < 2e-16 ***
## hour10         5.158e-01  1.134e-02  45.477  < 2e-16 ***
## hour11         4.759e-01  1.181e-02  40.288  < 2e-16 ***
## hour12         4.412e-01  1.222e-02  36.098  < 2e-16 ***
## hour13         4.027e-01  1.252e-02  32.162  < 2e-16 ***
## hour14         3.585e-01  1.267e-02  28.288  < 2e-16 ***
## hour15         3.058e-01  1.268e-02  24.122  < 2e-16 ***
## hour16         2.546e-01  1.181e-02  21.552  < 2e-16 ***
## hour17         3.099e-02  1.130e-02   2.744  0.00608 ** 
## hour18        -1.439e-01  1.083e-02 -13.285  < 2e-16 ***
## hour19        -1.736e-01  1.048e-02 -16.566  < 2e-16 ***
## hour20        -1.441e-01  1.026e-02 -14.045  < 2e-16 ***
## hour21        -1.192e-01  1.011e-02 -11.788  < 2e-16 ***
## hour22        -1.970e-04  9.760e-03  -0.020  0.98390    
## hour23        -1.320e-04  9.759e-03  -0.014  0.98921    
## cloud_mean    -1.000e-03  7.810e-05 -12.809  < 2e-16 ***
## flux_mean      6.056e-04  1.307e-05  46.345  < 2e-16 ***
## humidity_mean -1.129e-03  1.291e-04  -8.748  < 2e-16 ***
## temp_mean      6.770e-05  2.366e-04   0.286  0.77474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1518 on 11588 degrees of freedom
## Multiple R-squared:  0.8553, Adjusted R-squared:  0.855 
## F-statistic:  2537 on 27 and 11588 DF,  p-value: < 2.2e-16
checkresiduals(lm_model3)

## 
##  Breusch-Godfrey test for serial correlation of order up to 31
## 
## data:  Residuals
## LM test = 7311.4, df = 31, p-value < 2.2e-16
lm_forecast_test=predict(lm_model3,test_data_means_capacity)
test_data_means_capacity[,forecasted3:=as.numeric(lm_forecast_test)*rolling_max]
lm_forecast_control=predict(lm_model3,control_data_means_capacity)
control_data_means_capacity[,forecasted3:=as.numeric(lm_forecast_control)*rolling_max]
# for 7 days control
lm_forecast_control=predict(lm_model3,control_data_1week_means_capacity)
control_data_1week_means_capacity[,forecasted3:=as.numeric(lm_forecast_control)*rolling_max]

# postprocess the forecast based on actual production
control_data_means_capacity[production<=0,'forecasted3':=0]
control_data_means_capacity[forecasted3<=0,'forecasted3':=0]
control_data_means_capacity[production>=max(production),'forecasted3':=max(production)]
control_data_means_capacity[forecasted3>=max(production),'forecasted3':=max(production)]

control_data_1week_means_capacity[production<=0,'forecasted3':=0]
control_data_1week_means_capacity[forecasted3<=0,'forecasted3':=0]
control_data_1week_means_capacity[production>=max(production),'forecasted3':=max(production)]
control_data_1week_means_capacity[forecasted3>=max(production),'forecasted3':=max(production)]

# update forecast table
forecast_table[,forecast_with_means_normalized:=test_data_means_capacity[date==forecast_date]$forecasted3]

In the base model’s output, it can be seen that hourly variables are statistically significant in the day time, independent variables are also significant except for temperature. The reason for an insignificant temperature might be it is already explained by other variables. The residual standard error of the base model is 0.152 and this value will be compared with other model. The adjusted R-squared value is 0.854 which can be improved. After checking the residuals, it can be observed that there is a high autocorrelation between residuals at specific lag of 24. The normality assumption also does not hold. Lastly, the stationarity assumption does not hold with respect to variance. The variance of residuals tend to be lower on the summer months.

Model Adjustment 1

To improve the model, a lag variable will be added into the model. Ideally lag at 24 would be suitable however considering the available data and prediction task, lag at 72 will be added into the model.

lag72 <- vector()
train_data_means_capacity= cbind(train_data_means_capacity, lag72)
train_data_means_capacity$lag72[1:72] = NA
train_data_means_capacity$lag72[73:nrow(train_data_means_capacity)] = train_data_means_capacity$production[1:(nrow(train_data_means_capacity)-72)]
train_data_means_capacity = train_data_means_capacity[!is.na(lag72)]

test_data_means_capacity= cbind(test_data_means_capacity, lag72)
min = nrow(train_data_means_capacity) - 71
max = nrow(train_data_means_capacity)
test_data_means_capacity$lag72 = train_data_means_capacity$production[c(min:max)]

#control_data_means_capacity <- train_data_means_capacity[evaluation_period]
control_data_means_capacity= cbind(control_data_means_capacity, lag72)
control_data_means_capacity$lag72 = train_data_means_capacity$lag72[evaluation_period - 72]

control_data_1week_means_capacity= cbind(control_data_1week_means_capacity, lag72)
control_data_1week_means_capacity$lag72 = train_data_means_capacity$lag72[(nrow(train_data_means_capacity)-167):nrow(train_data_means_capacity)]

# train with all variables
lm_model5=lm(normalized_production~.,train_data_means_capacity[,-c('date','rolling_max','production','max_prod'),with=F])
summary(lm_model5)
## 
## Call:
## lm(formula = normalized_production ~ ., data = train_data_means_capacity[, 
##     -c("date", "rolling_max", "production", "max_prod"), with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85329 -0.03947 -0.00634  0.06484  0.59188 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.815e-01  6.979e-02   9.765  < 2e-16 ***
## hour1         -3.255e-04  9.251e-03  -0.035  0.97193    
## hour2         -5.458e-04  9.251e-03  -0.059  0.95296    
## hour3         -7.934e-04  9.252e-03  -0.086  0.93167    
## hour4         -5.457e-04  9.254e-03  -0.059  0.95298    
## hour5          1.863e-04  9.256e-03   0.020  0.98394    
## hour6          2.586e-02  9.264e-03   2.791  0.00526 ** 
## hour7          1.846e-01  9.440e-03  19.554  < 2e-16 ***
## hour8          4.241e-01  1.008e-02  42.070  < 2e-16 ***
## hour9          5.345e-01  1.051e-02  50.879  < 2e-16 ***
## hour10         3.741e-01  1.148e-02  32.599  < 2e-16 ***
## hour11         3.430e-01  1.182e-02  29.028  < 2e-16 ***
## hour12         3.172e-01  1.212e-02  26.182  < 2e-16 ***
## hour13         2.873e-01  1.232e-02  23.315  < 2e-16 ***
## hour14         2.541e-01  1.238e-02  20.521  < 2e-16 ***
## hour15         2.157e-01  1.230e-02  17.536  < 2e-16 ***
## hour16         1.841e-01  1.139e-02  16.171  < 2e-16 ***
## hour17         1.712e-02  1.073e-02   1.596  0.11056    
## hour18        -1.148e-01  1.031e-02 -11.137  < 2e-16 ***
## hour19        -1.361e-01  9.995e-03 -13.614  < 2e-16 ***
## hour20        -1.139e-01  9.762e-03 -11.670  < 2e-16 ***
## hour21        -9.476e-02  9.614e-03  -9.856  < 2e-16 ***
## hour22         1.613e-03  9.252e-03   0.174  0.86161    
## hour23         8.299e-04  9.251e-03   0.090  0.92852    
## cloud_mean    -1.191e-03  7.437e-05 -16.014  < 2e-16 ***
## flux_mean      5.005e-04  1.272e-05  39.346  < 2e-16 ***
## humidity_mean -9.681e-04  1.224e-04  -7.910 2.81e-15 ***
## temp_mean     -2.152e-03  2.319e-04  -9.279  < 2e-16 ***
## lag72          7.229e-03  1.990e-04  36.336  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1435 on 11515 degrees of freedom
## Multiple R-squared:  0.8712, Adjusted R-squared:  0.8709 
## F-statistic:  2781 on 28 and 11515 DF,  p-value: < 2.2e-16
checkresiduals(lm_model5)

## 
##  Breusch-Godfrey test for serial correlation of order up to 32
## 
## data:  Residuals
## LM test = 6806.5, df = 32, p-value < 2.2e-16
lm_forecast_test=predict(lm_model5,test_data_means_capacity)
test_data_means_capacity[,forecasted5:=as.numeric(lm_forecast_test)*rolling_max]
lm_forecast_control=predict(lm_model5,control_data_means_capacity)
control_data_means_capacity[,forecasted5:=as.numeric(lm_forecast_control)*rolling_max]
#for 1 week
lm_forecast_control=predict(lm_model5,control_data_1week_means_capacity)
control_data_1week_means_capacity[,forecasted5:=as.numeric(lm_forecast_control)*rolling_max]

# postprocess the forecast based on actual production
control_data_means_capacity[production<=0,'forecasted5':=0]
control_data_means_capacity[forecasted5<=0,'forecasted5':=0]
control_data_means_capacity[production>=max(production),'forecasted5':=max(production)]
control_data_means_capacity[forecasted5>=max(production),'forecasted5':=max(production)]

control_data_1week_means_capacity[production<=0,'forecasted5':=0]
control_data_1week_means_capacity[forecasted5<=0,'forecasted5':=0]
control_data_1week_means_capacity[production>=max(production),'forecasted5':=max(production)]
control_data_1week_means_capacity[forecasted5>=max(production),'forecasted5':=max(production)]

# update forecast table
forecast_table[,forecast_with_means_normalized_lag72:=test_data_means_capacity[date==forecast_date]$forecasted5]

With the addition of lag72 variable, our residual standard error decreased to 0.1438 and adjusted R-squared valued increased to 0.8698. In this new model, lag72 variable is significant as expected and temperature variable became significant as well. Autocorrelation decreased to some extent however it still exist. Other assumptions which are stationarity of the variance and normality of the residual do not hold.

Model Adjustment 2

Next, in order to handle serial dependence difference of the consecutive observations will be added to the model.

train_data_means_capacity[,differ1:=train_data_means_capacity$production-shift(train_data_means_capacity$production,1)]
test_data_means_capacity[,differ1:=train_data_means_capacity$production[c(min:max)]-shift(train_data_means_capacity$production[c(min:max)],1)]
train_data_means_capacity$differ1[1] = 0
test_data_means_capacity$differ1[1] = 0

differ1 <- vector()
control_data_means_capacity <- cbind(control_data_means_capacity, differ1)
control_data_means_capacity$differ1 <- train_data_means_capacity$differ1[evaluation_period-72]

control_data_1week_means_capacity <- cbind(control_data_1week_means_capacity, differ1)
control_data_1week_means_capacity$differ1 <- train_data_means_capacity$differ1[(nrow(train_data_means_capacity)-167):nrow(train_data_means_capacity)]

lm_model6=lm(normalized_production~.,train_data_means_capacity[,-c('date','rolling_max','production','max_prod'),with=F])
summary(lm_model6)
## 
## Call:
## lm(formula = normalized_production ~ ., data = train_data_means_capacity[, 
##     -c("date", "rolling_max", "production", "max_prod"), with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88293 -0.03708 -0.00605  0.05824  0.55906 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.379e-01  6.286e-02  10.149  < 2e-16 ***
## hour1         -2.769e-04  8.331e-03  -0.033  0.97349    
## hour2         -4.635e-04  8.332e-03  -0.056  0.95563    
## hour3         -6.581e-04  8.332e-03  -0.079  0.93705    
## hour4         -3.970e-04  8.334e-03  -0.048  0.96201    
## hour5         -4.387e-05  8.336e-03  -0.005  0.99580    
## hour6          7.845e-03  8.350e-03   0.940  0.34745    
## hour7          7.158e-02  8.777e-03   8.156 3.82e-16 ***
## hour8          2.716e-01  9.544e-03  28.459  < 2e-16 ***
## hour9          4.792e-01  9.521e-03  50.333  < 2e-16 ***
## hour10         3.432e-01  1.035e-02  33.156  < 2e-16 ***
## hour11         3.257e-01  1.065e-02  30.585  < 2e-16 ***
## hour12         2.978e-01  1.092e-02  27.280  < 2e-16 ***
## hour13         2.672e-01  1.111e-02  24.057  < 2e-16 ***
## hour14         2.426e-01  1.115e-02  21.746  < 2e-16 ***
## hour15         2.128e-01  1.108e-02  19.211  < 2e-16 ***
## hour16         2.276e-01  1.029e-02  22.122  < 2e-16 ***
## hour17         1.281e-01  9.897e-03  12.943  < 2e-16 ***
## hour18        -2.871e-02  9.430e-03  -3.045  0.00234 ** 
## hour19        -1.246e-01  9.004e-03 -13.843  < 2e-16 ***
## hour20        -1.367e-01  8.803e-03 -15.524  < 2e-16 ***
## hour21        -1.181e-01  8.670e-03 -13.621  < 2e-16 ***
## hour22         1.420e-03  8.332e-03   0.170  0.86467    
## hour23         7.292e-04  8.331e-03   0.088  0.93026    
## cloud_mean    -1.002e-03  6.708e-05 -14.936  < 2e-16 ***
## flux_mean      6.102e-04  1.165e-05  52.376  < 2e-16 ***
## humidity_mean -1.013e-03  1.102e-04  -9.188  < 2e-16 ***
## temp_mean     -1.997e-03  2.089e-04  -9.563  < 2e-16 ***
## lag72          5.556e-03  1.821e-04  30.517  < 2e-16 ***
## differ1        1.580e-02  3.051e-04  51.804  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1292 on 11514 degrees of freedom
## Multiple R-squared:  0.8955, Adjusted R-squared:  0.8953 
## F-statistic:  3403 on 29 and 11514 DF,  p-value: < 2.2e-16
checkresiduals(lm_model6)

## 
##  Breusch-Godfrey test for serial correlation of order up to 33
## 
## data:  Residuals
## LM test = 9587.3, df = 33, p-value < 2.2e-16
lm_forecast_test=predict(lm_model6,test_data_means_capacity)
test_data_means_capacity[,forecasted6:=as.numeric(lm_forecast_test)*rolling_max]
lm_forecast_control=predict(lm_model6,control_data_means_capacity)
control_data_means_capacity[,forecasted6:=as.numeric(lm_forecast_control)*rolling_max]
# for 1 week
lm_forecast_control=predict(lm_model6,control_data_1week_means_capacity)
control_data_1week_means_capacity[,forecasted6:=as.numeric(lm_forecast_control)*rolling_max]

# postprocess the forecast based on actual production
control_data_means_capacity[production<=0,'forecasted6':=0]
control_data_means_capacity[forecasted6<=0,'forecasted6':=0]
control_data_means_capacity[production>=max(production),'forecasted6':=max(production)]
control_data_means_capacity[forecasted6>=max(production),'forecasted6':=max(production)]

control_data_1week_means_capacity[production<=0,'forecasted6':=0]
control_data_1week_means_capacity[forecasted6<=0,'forecasted6':=0]
control_data_1week_means_capacity[production>=max(production),'forecasted6':=max(production)]
control_data_1week_means_capacity[forecasted6>=max(production),'forecasted6':=max(production)]

# update forecast table
forecast_table[,forecast_with_means_normalized_lag72_dif1:=test_data_means_capacity[date==forecast_date]$forecasted6]

With addition of diff1 variable, the model’s residual standard error decresead to 0.1295 and adjusted R-squared increased to 0.8943. The diff1 variable is significant as expected. However, the addition of the diff1 variable did not handle the autocorrelation of the residual as much as we expected.

Model Adjustment 3

Next, to increase the effect of difference, diff24 will be added into the model.

train_data_means_capacity[,differ24:=train_data_means_capacity$production-shift(train_data_means_capacity$production,24)]
test_data_means_capacity[,differ24:=train_data_means_capacity$production[c(min:max)]-shift(train_data_means_capacity$production[c(min:max)],24)]
train_data_means_capacity$differ24[1:24] = 0
test_data_means_capacity$differ24[1:24] = 0

differ24 <- vector()
control_data_means_capacity <- cbind(control_data_means_capacity, differ24)
control_data_means_capacity$differ24 <- train_data_means_capacity$differ24[evaluation_period-72]

control_data_1week_means_capacity <- cbind(control_data_1week_means_capacity, differ24)
control_data_1week_means_capacity$differ24 <- train_data_means_capacity$differ24[(nrow(train_data_means_capacity)-167):nrow(train_data_means_capacity)]


lm_model7=lm(normalized_production~.,train_data_means_capacity[,-c('date','rolling_max','production','max_prod'),with=F])
summary(lm_model7)
## 
## Call:
## lm(formula = normalized_production ~ ., data = train_data_means_capacity[, 
##     -c("date", "rolling_max", "production", "max_prod"), with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64597 -0.03308 -0.00538  0.04802  0.52972 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.288e-01  5.540e-02   5.935 3.01e-09 ***
## hour1          3.959e-05  7.309e-03   0.005   0.9957    
## hour2          1.288e-04  7.310e-03   0.018   0.9859    
## hour3          2.461e-04  7.310e-03   0.034   0.9731    
## hour4          7.420e-04  7.311e-03   0.101   0.9192    
## hour5          1.499e-03  7.313e-03   0.205   0.8376    
## hour6          1.329e-02  7.326e-03   1.813   0.0698 .  
## hour7          9.665e-02  7.712e-03  12.532  < 2e-16 ***
## hour8          3.037e-01  8.391e-03  36.198  < 2e-16 ***
## hour9          4.896e-01  8.355e-03  58.600  < 2e-16 ***
## hour10         3.704e-01  9.094e-03  40.728  < 2e-16 ***
## hour11         3.529e-01  9.353e-03  37.729  < 2e-16 ***
## hour12         3.279e-01  9.592e-03  34.184  < 2e-16 ***
## hour13         2.989e-01  9.758e-03  30.627  < 2e-16 ***
## hour14         2.729e-01  9.800e-03  27.852  < 2e-16 ***
## hour15         2.410e-01  9.730e-03  24.768  < 2e-16 ***
## hour16         2.371e-01  9.027e-03  26.270  < 2e-16 ***
## hour17         1.199e-01  8.684e-03  13.805  < 2e-16 ***
## hour18        -3.344e-02  8.274e-03  -4.042 5.33e-05 ***
## hour19        -1.147e-01  7.901e-03 -14.523  < 2e-16 ***
## hour20        -1.206e-01  7.728e-03 -15.611  < 2e-16 ***
## hour21        -1.035e-01  7.611e-03 -13.604  < 2e-16 ***
## hour22         5.008e-04  7.310e-03   0.069   0.9454    
## hour23         2.601e-04  7.309e-03   0.036   0.9716    
## cloud_mean    -6.150e-04  5.921e-05 -10.386  < 2e-16 ***
## flux_mean      5.241e-04  1.033e-05  50.760  < 2e-16 ***
## humidity_mean -1.048e-03  9.671e-05 -10.832  < 2e-16 ***
## temp_mean     -9.192e-04  1.842e-04  -4.991 6.08e-07 ***
## lag72          6.042e-03  1.599e-04  37.775  < 2e-16 ***
## differ1        1.225e-02  2.744e-04  44.646  < 2e-16 ***
## differ24       1.085e-02  1.849e-04  58.702  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1133 on 11513 degrees of freedom
## Multiple R-squared:  0.9196, Adjusted R-squared:  0.9194 
## F-statistic:  4389 on 30 and 11513 DF,  p-value: < 2.2e-16
checkresiduals(lm_model7)

## 
##  Breusch-Godfrey test for serial correlation of order up to 34
## 
## data:  Residuals
## LM test = 8908, df = 34, p-value < 2.2e-16
lm_forecast_test=predict(lm_model7,test_data_means_capacity)
test_data_means_capacity[,forecasted7:=as.numeric(lm_forecast_test)*rolling_max]
lm_forecast_control=predict(lm_model7,control_data_means_capacity)
control_data_means_capacity[,forecasted7:=as.numeric(lm_forecast_control)*rolling_max]
# for 1 week
lm_forecast_control=predict(lm_model7,control_data_1week_means_capacity)
control_data_1week_means_capacity[,forecasted7:=as.numeric(lm_forecast_control)*rolling_max]

# postprocess the forecast based on actual production
control_data_means_capacity[production<=0,'forecasted7':=0]
control_data_means_capacity[forecasted7<=0,'forecasted7':=0]
control_data_means_capacity[production>=max(production),'forecasted7':=max(production)]
control_data_means_capacity[forecasted7>=max(production),'forecasted7':=max(production)]

control_data_1week_means_capacity[production<=0,'forecasted7':=0]
control_data_1week_means_capacity[forecasted7<=0,'forecasted7':=0]
control_data_1week_means_capacity[production>=max(production),'forecasted7':=max(production)]
control_data_1week_means_capacity[forecasted7>=max(production),'forecasted7':=max(production)]

# update forecast table
forecast_table[,forecast_with_means_normalized_lag72_dif1_dif24:=test_data_means_capacity[date==forecast_date]$forecasted7]

Diff24 variable is significant in the model. The addition of the diff24 variable improved the model with the decrease in the residual standard error and increase in the adjusted R-squared value. However, it did not handle the autocorrelation problem.

Model Adjustment 4

Next, we decided to add a dummy variable to check outliers in the model which excess the 95% significance level. The reason behind this addition is to accomplish stationarity of the variance and normality of the residuals.

# outliers exceed 95% significance level removed
train_data_means_capacity[,residuals:=NA] 
train_data_means_capacity$residuals <- lm_model7$residuals[1:(nrow(train_data_means_capacity))]

# Detecting lower and upper 5% residuals and marking them as outlier_small and outlier_great
train_data_means_capacity[!is.na(residuals), quant5:=quantile(residuals,0.05)]
train_data_means_capacity[!is.na(residuals), quant95:=quantile(residuals,0.95)]
train_data_means_capacity[,outlier_small:=as.numeric(residuals<quant5)]
train_data_means_capacity[,outlier_great:=as.numeric(residuals>quant95)]

residual <- vector()
control_data_means_capacity <- cbind(control_data_means_capacity, residuals)
control_data_means_capacity$residuals <- train_data_means_capacity$residuals[evaluation_period-72]

control_data_means_capacity[!is.na(residuals), quant5:=quantile(residuals,0.05)]
control_data_means_capacity[!is.na(residuals), quant95:=quantile(residuals,0.95)]
control_data_means_capacity[,outlier_small:=as.numeric(residuals<quant5)]
control_data_means_capacity[,outlier_great:=as.numeric(residuals>quant95)]

test_data_means_capacity[,outlier_small:=0]
test_data_means_capacity[,outlier_great:=0]

control_data_1week_means_capacity <- cbind(control_data_1week_means_capacity, residuals)
control_data_1week_means_capacity$residuals <- train_data_means_capacity$residuals[(nrow(train_data_means_capacity)-167):nrow(train_data_means_capacity)]

control_data_1week_means_capacity[!is.na(residuals), quant5:=quantile(residuals,0.05)]
control_data_1week_means_capacity[!is.na(residuals), quant95:=quantile(residuals,0.95)]
control_data_1week_means_capacity[,outlier_small:=as.numeric(residuals<quant5)]
control_data_1week_means_capacity[,outlier_great:=as.numeric(residuals>quant95)]

lm_model8=lm(normalized_production~.,train_data_means_capacity[,-c('date','rolling_max','production','max_prod','quant5', 'quant95', 'residuals'),with=F])
summary(lm_model8)
## 
## Call:
## lm(formula = normalized_production ~ ., data = train_data_means_capacity[, 
##     -c("date", "rolling_max", "production", "max_prod", "quant5", 
##         "quant95", "residuals"), with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38817 -0.02540 -0.00241  0.03139  0.25225 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    2.283e-01  3.366e-02    6.782 1.24e-11 ***
## hour1         -1.507e-06  4.439e-03    0.000   0.9997    
## hour2          1.902e-05  4.440e-03    0.004   0.9966    
## hour3          7.130e-05  4.440e-03    0.016   0.9872    
## hour4          3.228e-04  4.441e-03    0.073   0.9421    
## hour5          8.568e-04  4.442e-03    0.193   0.8470    
## hour6          1.166e-02  4.450e-03    2.620   0.0088 ** 
## hour7          7.747e-02  4.692e-03   16.511  < 2e-16 ***
## hour8          3.140e-01  5.258e-03   59.728  < 2e-16 ***
## hour9          5.025e-01  5.235e-03   96.003  < 2e-16 ***
## hour10         3.625e-01  5.658e-03   64.067  < 2e-16 ***
## hour11         3.321e-01  5.808e-03   57.177  < 2e-16 ***
## hour12         3.089e-01  5.947e-03   51.930  < 2e-16 ***
## hour13         2.837e-01  6.055e-03   46.847  < 2e-16 ***
## hour14         2.616e-01  6.099e-03   42.883  < 2e-16 ***
## hour15         2.365e-01  6.087e-03   38.860  < 2e-16 ***
## hour16         2.303e-01  5.620e-03   40.987  < 2e-16 ***
## hour17         1.133e-01  5.325e-03   21.270  < 2e-16 ***
## hour18        -3.513e-02  5.035e-03   -6.978 3.17e-12 ***
## hour19        -1.142e-01  4.803e-03  -23.788  < 2e-16 ***
## hour20        -1.238e-01  4.697e-03  -26.347  < 2e-16 ***
## hour21        -1.019e-01  4.624e-03  -22.033  < 2e-16 ***
## hour22         3.101e-04  4.440e-03    0.070   0.9443    
## hour23         1.581e-04  4.439e-03    0.036   0.9716    
## cloud_mean    -2.366e-04  3.607e-05   -6.560 5.59e-11 ***
## flux_mean      5.073e-04  6.292e-06   80.632  < 2e-16 ***
## humidity_mean -6.797e-04  5.879e-05  -11.560  < 2e-16 ***
## temp_mean     -6.555e-04  1.119e-04   -5.859 4.79e-09 ***
## lag72          7.212e-03  9.895e-05   72.890  < 2e-16 ***
## differ1        1.177e-02  1.668e-04   70.547  < 2e-16 ***
## differ24       1.114e-02  1.123e-04   99.191  < 2e-16 ***
## outlier_small -3.250e-01  3.214e-03 -101.122  < 2e-16 ***
## outlier_great  2.507e-01  3.181e-03   78.809  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06884 on 11511 degrees of freedom
## Multiple R-squared:  0.9703, Adjusted R-squared:  0.9703 
## F-statistic: 1.177e+04 on 32 and 11511 DF,  p-value: < 2.2e-16
checkresiduals(lm_model8)

## 
##  Breusch-Godfrey test for serial correlation of order up to 36
## 
## data:  Residuals
## LM test = 3608.7, df = 36, p-value < 2.2e-16
lm_forecast_test=predict(lm_model8,test_data_means_capacity)
test_data_means_capacity[,forecasted8:=as.numeric(lm_forecast_test)*rolling_max]
lm_forecast_control=predict(lm_model8,control_data_means_capacity)
control_data_means_capacity[,forecasted8:=as.numeric(lm_forecast_control)*rolling_max]
# for 1 week
lm_forecast_control=predict(lm_model8,control_data_1week_means_capacity)
control_data_1week_means_capacity[,forecasted8:=as.numeric(lm_forecast_control)*rolling_max]

# postprocess the forecast based on actual production
control_data_means_capacity[production<=0,'forecasted8':=0]
control_data_means_capacity[forecasted8<=0,'forecasted8':=0]
control_data_means_capacity[production>=max(production),'forecasted8':=max(production)]
control_data_means_capacity[forecasted8>=max(production),'forecasted8':=max(production)]

control_data_1week_means_capacity[production<=0,'forecasted8':=0]
control_data_1week_means_capacity[forecasted8<=0,'forecasted8':=0]
control_data_1week_means_capacity[production>=max(production),'forecasted8':=max(production)]
control_data_1week_means_capacity[forecasted8>=max(production),'forecasted8':=max(production)]


# update forecast table
forecast_table[,forecast_with_means_normalized_lag72_dif1_dif24_outlier:=test_data_means_capacity[date==forecast_date]$forecasted8]

With addition of dummy variables which check outliers, the residual standard error decreased significantly to 0.06912, adjusted R-squared value increased to 0.9699. The dummy variables are both significant. After checking the residuals, the stationary of the variance and normality of the residuals seems better compared to previous models. Considering the all statistics, this is the best model we achieved. Therefore, this model will be used as our final model.

4 - Results

Different from the base model which contains the averages of weather variables, hourly seasonality and normalized production according to capacity, the final model additionally includes lag72, differ1, differ24, outlier_great and outlier_small variables. It is important that all additional variables are statistically significant in the model.

Thanks to added variables, our model performs better in terms of residual standard errors, adjusted R-quared value and weighted mean absolute percentage error. In the base model, the residual standard error was 0.1518 and adjusted R-squared value was 0.855. In the final model, the residual standard error decreased to 0.06884 and adjusted R-squared value increased 0.9703.

Now, lets compare the base model and final model visually:

cols <- c("forecasted" = "orange", "actual" = "blue")
ggplot() + 
  geom_line(data = control_data_means_capacity, aes(x = datetime, y = forecasted3,color = "forecasted")) +
  geom_line(data = control_data_means_capacity, aes(x = datetime, y = production,color = "actual")) +
  labs(title = "Predicted vs. Actual for Base Model",
       subtitle = "1 March 2022- 24 May 2022",
       x = "Time",
       y = "Production",
       color = "Color") +
  scale_color_manual(values = cols) +
  scale_x_datetime(breaks = "1 month",expand = c(0,0)) +
  theme_minimal()

ggplot() + 
  geom_line(data = control_data_means_capacity, aes(x = datetime, y = forecasted8,color = "forecasted")) +
  geom_line(data = control_data_means_capacity, aes(x = datetime, y = production,color = "actual")) +
  labs(title = "Predicted vs. Actual for Final Model",
       subtitle = "1 March 2022- 24 May 2022",
       x = "Time",
       y = "Production",
       color = "Color") +
  scale_color_manual(values = cols) +
  scale_x_datetime(breaks = "14 days",expand = c(0,0)) +
  theme_minimal()

After checking the two plots, it can be easily observed that our final model explains the available data way better than base model.

Another way to compare the base and final model is to check statistics such as bias, MAD and WMAPE:

#reporting accuracy
statistics <- function(actual, forecasted){
  n=length(actual)
  error = actual - forecasted
  mean = mean(actual)
  sd = sd(actual)
  bias = sum(error) / sum(actual)
  mad = sum(abs(error)) / n
  wmape = mad / mean
  l = data.frame(n, mean, sd, bias, mad, wmape)
  colnames(l) <- c("N", "Mean", "Standard Deviation", "Bias", "MAD", "WMAPE")
  return(l)
}
# First Model Statistics for evaluation period
kable(statistics(control_data_means_capacity$production, control_data_means_capacity$forecasted3),
      caption = "Statistics of Base Model for evaluation period ", align = 'c')
Statistics of Base Model for evaluation period
N Mean Standard Deviation Bias MAD WMAPE
2040 12.13835 14.68011 0.0467248 2.932501 0.2415897
# Final Model Statistics for evaluation period
kable(statistics(control_data_means_capacity$production, control_data_means_capacity$forecasted8),
      caption = "Statistics of Final Model for evaluation period", align = 'c')
Statistics of Final Model for evaluation period
N Mean Standard Deviation Bias MAD WMAPE
2040 12.13835 14.68011 0.0344347 1.259384 0.1037525

The results tell us that final model is better than base model in terms of all statistics. We managed to decrease our WMAPE result from 0.2415897 to 0.1037525.

Lastly, in order to see the better visualization of how well our final model fit the available data, we will check the actual production vs prediction for the last week of available production data.

ggplot() + 
  geom_line(data = control_data_1week_means_capacity, aes(x = datetime, y = forecasted3,color = "forecasted")) +
  geom_line(data = control_data_1week_means_capacity, aes(x = datetime, y = production,color = "actual")) +
  labs(title = "Predicted vs. Actual for Base Model",
       subtitle = "Last Week of Available Data",
       x = "Time",
       y = "Production",
       color = "Color") +
  scale_color_manual(values = cols) +
  scale_x_datetime(breaks = "24 hours",expand = c(0,0)) +
  theme_minimal()

ggplot() + 
  geom_line(data = control_data_1week_means_capacity, aes(x = datetime, y = forecasted8,color = "forecasted")) +
  geom_line(data = control_data_1week_means_capacity, aes(x = datetime, y = production,color = "actual")) +
  labs(title = "Predicted vs. Actual for Final Model",
       subtitle = "Last Week of Available Data",
       x = "Time",
       y = "Production",
       color = "Color") +
  scale_color_manual(values = cols) +
  scale_x_datetime(breaks = "24 hours",expand = c(0,0)) +
  theme_minimal()

Plots of the last available week of the data represents the model fit better than evaluation period. When the base and final model plots are compared, it can be seen that our final model fit better to the data than base model. However, it is obvious that our final model is not perfect. Even if its residual standard error and WMAPE results are improved, the plot tells us that our forecasted values are somehow unstable during the daytime which distorts the model.

5 - Conclusion and Future Work

To conclude, even though our model has improved for the forecasting period over time with the additions, it can be seen that the model can be improved further with different approaches.

Firstly, we did not go into the details of independent weather variables. We followed a data-driven approach and by taking the average values of the different locations and coordinates, we put these variables into our regression models. Therefore, a detailed analysis of these variables can be done and the model can be improved by taking these details into the consideration.

Next, we did not use ARIMA models in our models and predictions. Therefore, we did not see how such models work in this project. ARIMA models include past errors and previous observations so it can be useful in this project when it is combined with indepenent regressors.

Moreover, a different data set can be included into this project. There might be other variables to affect the solar power production such as legal regulations, special days, management operational decisions, maintenance periods etc. Therefore, such additions can make the model better. For example, in the competition phase, it can be seen that the last 3 days of the production shows an increasing trend for the production capacity. There might be reasons behind this increase.

Therefore, even though we reached a better model than th first model that is used, the model has the space for improvement.

6 - Appendices

7 - References

Panjwani, Manoj & Narejo, G.B.. (2014). Effect of humidity on the efficiency of solar cell (Photovoltaic). Int J Eng Res General Sci. 2. 499-503.

Amusan J. A. & Otokunefor E.B.(2019) The Effect of Cloud on the Output Performance of a Solar Module. IJESC. 9.2. 19665,19671.